Assessing the efficiency of clustering algorithms and goodness-of-fit measures using phytoplankton field data
نویسندگان
چکیده
a r t i c l e i n f o Keywords: 2-norm Cophenetic correlation coefficient Beta diversity Dendrogram UPGMA Ward's algorithm Investigation of patterns in beta diversity has received increased attention over the last years particularly in light of new ecological theories such as the metapopulation paradigm and metacommunity theory. Traditionally , beta diversity patterns can be described by cluster analysis (i.e. dendrograms) that enables the classification of samples. Clustering algorithms define the structure of dendrograms, consequently assessing their performance is crucial. A common, although not always appropriate approach for assessing algorithm suit-ability is the cophenetic correlation coefficient c. Alternatively the 2-norm has been recently proposed as an increasingly informative method for evaluating the distortion engendered by clustering algorithms. In the present work, the 2-norm is applied for the first time on field data and is compared with the cophenetic correlation coefficient using a set of 105 pairwise combinations of 7 clustering methods (e.g. UPGMA) and 15 (dis)similarity/distance indices (e.g. Jaccard index). In contrast to the 2-norm, cophenetic correlation coefficient does not provide a clear indication on the efficiency of the clustering algorithms for all combinations. The two approaches were not always in agreement in the choice of the most faithful algorithm. Additionally, the 2-norm revealed that UPGMA is the most efficient clustering algorithm and Ward's the least. The present results suggest that goodness-of-fit measures such as the 2-norm should be applied prior to clustering analyses for reliable beta diversity measures. Enhancing our knowledge of the processes that shape variability in community structure (beta diversity) remains one of the fundamental challenges in contemporary community ecology (Condit et al., 2002; Gaston et al., 2007; Tuomisto et al., 2003). To meet this challenge , applications of multivariate statistics in community ecology have expanded significantly during the last two decades (Dray and Among multivariate statistical methods, ordination and clustering are now routinely applied by ecologists to explore the spatial or temporal turnover of field communities (e.g. Among clustering methods, those based on a hierarchy of clusters have been used in many research fields (e.g. Torrente-Vilara et al., 2011). Hierarchical clustering analysis is subdivided into agglomerative and divisive methods, the former being most common in ecological studies (Clarke and Warwick, 2001). All ag-glomerative procedures begin with an initial matrix (I) which is then transformed in inter-objects/samples matrix (D) using a relevant distance measure whose selection depends on the scientific question (see Fig. 1). …
منابع مشابه
On the Canonical-Based Goodness-of-fit Tests for Multivariate Skew-Normality
It is well-known that the skew-normal distribution can provide an alternative model to the normal distribution for analyzing asymmetric data. The aim of this paper is to propose two goodness-of-fit tests for assessing whether a sample comes from a multivariate skew-normal (MSN) distribution. We address the problem of multivariate skew-normality goodness-of-fit based on the empirical Laplace tra...
متن کاملThe ensemble clustering with maximize diversity using evolutionary optimization algorithms
Data clustering is one of the main steps in data mining, which is responsible for exploring hidden patterns in non-tagged data. Due to the complexity of the problem and the weakness of the basic clustering methods, most studies today are guided by clustering ensemble methods. Diversity in primary results is one of the most important factors that can affect the quality of the final results. Also...
متن کاملارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها
Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...
متن کاملFuzzy clustering of time series data: A particle swarm optimization approach
With rapid development in information gathering technologies and access to large amounts of data, we always require methods for data analyzing and extracting useful information from large raw dataset and data mining is an important method for solving this problem. Clustering analysis as the most commonly used function of data mining, has attracted many researchers in computer science. Because o...
متن کاملUse of the Improved Frog-Leaping Algorithm in Data Clustering
Clustering is one of the known techniques in the field of data mining where data with similar properties is within the set of categories. K-means algorithm is one the simplest clustering algorithms which have disadvantages sensitive to initial values of the clusters and converging to the local optimum. In recent years, several algorithms are provided based on evolutionary algorithms for cluster...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Ecological Informatics
دوره 9 شماره
صفحات -
تاریخ انتشار 2012